Introduction

In this report, the goal is to build on the spatial risk predictive model and create a tool that Police can use to more efficiently allocate their limited resources. Our hypothesis is that risk is a function of exposure to the kinds of environmental variables we think may be associated with crime. Think not only about ‘risk factors’ but about ‘protective factors’ as well. Examples of the latter might include churches and recreation centers.

In this way, to better predict the risk, the model used in this report would be poisson regression. This report can be divided into four parts:

  • Data wrangling
  • Exploratory anlaysis
  • Estimate models
  • Goodness of fit
  • Interpretation
library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(FNN)
library(ggmap)
library(ggplot2)
library(RColorBrewer)
library(viridis)
library(kableExtra)
library(leaflet)
options(scipen=999)
setwd("/Users/wangzixuan/Desktop/Fri. Ken/week12")

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

nn_function <- function(measureFrom,measureTo,k) {
  
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint)
  
  return(output)  
}

1: Data wrangling

Our hypothesis is that risk is a function of exposure to the kinds of environmental variables we think may be associated with crime. Therefore, the first section is all about downloading, wrangling and feature engeering the relevant datasets to create more useful features.

1.1: Dependent Variables

Our dependent variable for this study would be the assult crimes in 2017. We can easily captured this dataset from Chicago Open Data Portal and plot the spatial loction of the assault points by poice beats in Chicago city.

crimePoints <- read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr")
crimePoints <- 
  crimePoints %>%
  filter(Primary.Type == "ASSAULT") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y))

policeBeats <- st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON")

ggplot() + 
  geom_sf(data=st_union(policeBeats), aes(), fill='black', colour=NA) +
  geom_point(data=crimePoints, aes(X,Y), colour="yellow", size=0.1) +
  labs(title= "Assaults, Chicago - 2017") +
  mapTheme()

Due to too many points, the spatial distribution cannot be clearly distinguished. In this way, in order to get a better sense for the spatial distribution, we can aggregate the point data to the “fishnet” - a lattice grid of polygons that we will use to define “place” in the analysis.

policeBeats <- st_transform(st_union(policeBeats), crs=102271)

fishnet <- 
  st_make_grid(st_union(policeBeats), cellsize = 500) %>%
  st_intersection(policeBeats, .) %>%
  st_sf() %>%
  mutate(uniqueID = rownames(.))

crimePoints.sf <-
  crimePoints %>%
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet))

crime_net <- 
  crimePoints.sf %>% 
  dplyr::select() %>% 
  mutate(countAssaults = 1) %>% 
  aggregate(., fishnet, length) %>%
  mutate(countAssaults = ifelse(is.na(countAssaults), 0, countAssaults),
         uniqueID = rownames(.))

ggplot() + 
  geom_sf(data=crime_net, aes(fill=countAssaults)) + 
  scale_fill_viridis() +
  labs(title= "Number of Assaults by fishnet, Chicago - 2017",
       subtitle=" 500m x 500m") +
  mapTheme()

1.2: Independent Variables

I. Environment features

Abandoned_Buildings

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
  mutate(year = substr(DATE.SERVICE.REQUEST.WAS.RECEIVED,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(LATITUDE,LONGITUDE, Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Abandoned_Buildings")

Liquor_Retail

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Liquor-and-Public-Places/nrmj-3kcf") %>%
  filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
  dplyr::select(LATITUDE,LONGITUDE, Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

Street_Lights_Out

rodent: 311 rodent biating report

rodent<-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-No-Duplicates/uqhs-j723") %>%
  mutate(year = substr(Creation.Date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Latitude,Longitude, Y = Latitude, X = Longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "rodent")

Now, wrangle all the environmental independent variables into one dataset.

allVars <- 
  rbind(streetLightsOut,abandonBuildings,liquorRetail,rodent) %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  as.data.frame()

ggplot() +
  geom_sf(data=st_union(policeBeats)) +
  geom_point(data = allVars, aes(X,Y), size = .08) +
  facet_wrap(~Legend) + 
  labs(title= "Scatterplot of independent variables") +
  mapTheme()

Spatial join each independent variable dataset into fishnet so that we can calculate the amount of each variable in each fishnet grid cell.Now, we can map each each variable in the same spatial scale.

allVars.sf <- 
  rbind(rodent,streetLightsOut,abandonBuildings,liquorRetail) 

vars_net <- 
  st_join(allVars.sf, fishnet, join=st_within) %>%
  as.data.frame() %>%
  group_by(uniqueID,Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(Legend,count, fill=0) %>%
  st_sf() %>%
  na.omit()

vars_net %>%
  gather(key,value,-geometry, -uniqueID) %>%
  mutate(value = ifelse(is.na(value),0,value)) %>%
  filter(key != "<NA>") %>%
  ggplot(data=.) +
  geom_sf(aes(fill=value), colour=NA) +
  facet_wrap(~key) + 
  labs(title= "Counts of independent variables by fish net",
        subtitle="500m x 500m") +
  mapTheme() +
  scale_fill_viridis()

Since each dataset has a different range, the scatterplot can be more helpful in identifying whether spatial cluster exists in each dependent variable.

III.Spatial structure

In order to perform average nearest neighbor distance, it is better to recode the fishnet cells to points. First, use st_centroid to do so. Then output a matrix which we need for the nearest neighbor function.

vars_net.xy <- 
  st_centroid(vars_net) %>%
  cbind(.,st_coordinates(st_centroid(vars_net))) %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

distSchools: average distance to two nearest schools

schools <- 
  st_read("https://data.cityofchicago.org/api/geospatial/3fhj-xtn5?method=export&format=GeoJSON") %>%
  st_transform(st_crs(fishnet))

schools.xy <- 
  schools %>%
  cbind(.,st_coordinates(st_centroid(schools)))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

distSchools <- as.data.frame(nn_function(vars_net.xy,schools.xy,2)) 

vars_net <- 
  cbind(as.data.frame(vars_net),distSchools) %>%
  st_sf() %>%
  rename(schoolsDistance=pointDistance)

ggplot() +
  geom_sf(data=vars_net,aes(fill=schoolsDistance)) +
  geom_point(data=data.frame(schools.xy), aes(X,Y),size=0.3,alpha=0.5,color="pink") +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Average nearest neighbor distance to schools",
       subtitle="k=2 ",
       legend="distance")

distclinic: average distance to two nearest clinics

clinic <- 
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Chicago-Department-of-Public-Health-Clinic-Locatio/4msa-kt5t") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y))%>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "clinic")

clinic.xy <- 
  clinic %>%
  cbind(.,st_coordinates(st_centroid(clinic)))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

distclinic <- as.data.frame(nn_function(vars_net.xy,clinic.xy,2)) 
vars_net <- 
  cbind(as.data.frame(vars_net),distclinic) %>%
  st_sf() %>%
  rename(distclinic=pointDistance)

ggplot() +
  geom_sf(data=vars_net,aes(fill=distclinic)) +
  geom_point(data=data.frame(clinic.xy), aes(X,Y),size=0.5,alpha=0.5,color="green") +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Average nearest neighbor distance to clinics",
       subtitle="k=2 ")

distpolice: average distance to 3 nearest police stations

police_sta <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations-Map/gkur-vufi") %>%
  mutate(x = gsub("[()]", "", LOCATION)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y))%>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "police_sta")

police_sta.xy <- 
  police_sta %>%
  cbind(.,st_coordinates(st_centroid(police_sta)))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

distpolice<- as.data.frame(nn_function(vars_net.xy,police_sta.xy,3)) 

vars_net <- 
  cbind(as.data.frame(vars_net),distpolice) %>%
  st_sf() %>%
  rename(distpolice=pointDistance)

ggplot() +
  geom_sf(data=vars_net,aes(fill=distpolice)) +
  geom_point(data=data.frame(police_sta.xy), aes(X,Y),size=0.5,alpha=0.5,color="grey") +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Average nearest neighbor distance to police stations",
       subtitle="k=3")

distlandmarks: distance to 3 nearest landmark

landmarks <- 
  read.socrata("https://data.cityofchicago.org/Historic-Preservation/Individual-Landmarks-Map/habu-n236") %>%
  dplyr::select(LATITUDE,LONGITUDE, Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "landmarks")

land.xy <- 
  landmarks %>%
  cbind(.,st_coordinates(st_centroid(landmarks)))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

distlandmarks <- as.data.frame(nn_function(vars_net.xy,land.xy,3)) 

vars_net <- 
  cbind(as.data.frame(vars_net),distlandmarks) %>%
  st_sf() %>%
  rename(distlandmarks=pointDistance)

ggplot() +
  geom_sf(data=vars_net,aes(fill=distlandmarks)) +
  geom_point(data=data.frame(land.xy), aes(X,Y),size=0.3,alpha=0.5,color="orange") +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Average nearest neighbor distance to landmarks",
       subtitle="k=3")

loopDistance: distance to the loop district

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

loopPoint <-
  neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

ggplot() +
  geom_sf(data=vars_net,aes(fill=loopDistance)) +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Distance to the loop district")

dist_Sanitation: distance to the 5 nearest 311 sanitation report

Sanitation<- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints/me59-5fac") %>%
  mutate(year = substr(Creation.Date,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(Latitude,Longitude, Y = Latitude, X = Longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation")

Sanitation.xy <- 
  Sanitation %>%
  cbind(.,st_coordinates(st_centroid(Sanitation)))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

dist_Sanitation<- as.data.frame(nn_function(vars_net.xy,Sanitation.xy,5))

vars_net <- 
  cbind(as.data.frame(vars_net),dist_Sanitation) %>%
  st_sf() %>%
  rename(dist_Sanitation=pointDistance)

ggplot() +
  geom_sf(data=vars_net,aes(fill=dist_Sanitation)) +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Average nearest neighbor distance to 311 sanitation report",
       subtitle="k=5")

dist_potholes: distance to the 5 nearest 311 potholes report

potholes<-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Pot-Holes-Reported/7as2-ds3y") %>%
  mutate(year = substr(CREATION.DATE,1,4)) %>%
  filter(year == "2017") %>%
  dplyr::select(LATITUDE,LONGITUDE, Y = LATITUDE, X = LONGITUDE) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_sf() %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "potholes")

potholes.xy <- 
  potholes %>%
  cbind(.,st_coordinates(st_centroid(potholes)))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

dist_potholes<- as.data.frame(nn_function(vars_net.xy,potholes.xy,5)) 

vars_net <- 
  cbind(as.data.frame(vars_net),dist_potholes) %>%
  st_sf() %>%
  rename(dist_potholes=pointDistance)

ggplot() +
  geom_sf(data=vars_net,aes(fill=dist_potholes)) +
  scale_fill_viridis() +
  mapTheme() +
  labs(title="Average nearest neighbor distance to 311 potholes report",
       subtitle="k=5")

spatialag: distance to other 10 nearest assault crime points

spatiallag <- as.data.frame(nn_function(vars_net.xy,vars_net.xy,11)) 

vars_net <- 
  cbind(as.data.frame(vars_net),spatiallag) %>%
  st_sf() %>%
  rename(spatialdis=pointDistance)

1.3: Join the cirme data to variable data

Finally, we join the crime_net and vars_net fishnets into our final dataset, final_net.

final_net <-
  left_join(crime_net,data.frame(vars_net), by="uniqueID") %>%
  dplyr::select(-geometry,-geometry.y)
  
head(final_net)
## Simple feature collection with 6 features and 15 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 359653.4 ymin: 552875.4 xmax: 362577.3 ymax: 553375.4
## epsg (SRID):    102271
## proj4string:    +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
##   countAssaults uniqueID Abandoned_Buildings Liquor_Retail rodent
## 1             5        1                   1             0      1
## 2             0        2                   0             0      0
## 3             0        3                   0             0      0
## 4             0        4                   0             0      0
## 5             0        5                   0             0      0
## 6             0        6                   0             0      0
##   Street_Lights_Out X.NA. schoolsDistance distclinic distpolice
## 1                 5     0        936.1086   12730.32   6551.199
## 2                 0     0        856.9091   12660.57   6567.802
## 3                 0     0        744.8698   12601.64   6615.303
## 4                 0     0        800.7048   12561.48   6694.384
## 5                 0     0       1046.2735   12537.04   6801.183
## 6                 0     0       1284.9225   12531.22   6938.260
##   distlandmarks loopDistance dist_Sanitation dist_potholes spatialdis
## 1      4096.695     25923.34        327.2810      275.6396   818.7367
## 2      4173.082     25947.48        613.5621      490.4322   744.4533
## 3      4225.645     25982.62        815.8853      567.2197   734.0936
## 4      4239.686     26028.00       1163.0317      698.2375   737.0627
## 5      4286.523     26079.69       1570.5669      913.6268   734.8873
## 6      4367.425     26140.79       2008.1002     1114.9254   732.5716
##                       geometry.x
## 1 POLYGON ((360077.3 552880.7...
## 2 POLYGON ((360577.3 552885.6...
## 3 POLYGON ((361077.3 552890.3...
## 4 POLYGON ((361577.3 552894.4...
## 5 POLYGON ((362077.3 552909.8...
## 6 POLYGON ((362577.3 552912.7...

2: Exploratory analysis

2.1: Normality Test

As can be seen from the histogram of the dependent variable below, there are alot of 0 countassault values of house prices.

ggplot(final_net, aes(countAssaults)) + geom_histogram(binwidth = 1)

Obviously,we cannot take the log of 0 since we don’t want to drop most of the dataset as zero and on top of that taking the log of countAssaults does not make the distribution normal.

ggplot(crime_net, aes(log(countAssaults))) + geom_histogram(binwidth = 1)

The distribution of crime counts is non-linear. It is because crime, relatively speaking, is a rare event. In this case we need to use Poisson regression.

2.2: Relationship between dependent variable and the independent variables

The map below shows the relationship betwwen dependent variable and each independent variable.

final_net333 <- 
final_net %>%
  gather(key,value,-geometry.x,-uniqueID,-countAssaults) %>%
  mutate(value = ifelse(is.na(value),0,value)) %>%
  filter(key != "X.NA.")

ggplot(final_net333, aes(x=value, y=countAssaults)) +
  geom_point(size=1, shape=21,col="black")+theme_bw()+
  stat_smooth(method = "lm", se = FALSE, size =0.5, colour="red")+ 
  facet_wrap(~key)+
  labs(title="Relationship between dependent variable and independent variables")

3: Estimate models

3.1: Standardized regression coefficients plot

As can be seen from the figure below, the distance to the sanitation reports has the greatest impact on the model, followed by the distance to schools.

reg <- glm(countAssaults ~ ., family = "poisson", 
            data= final_net %>% 
              as.data.frame %>% 
              dplyr::select(countAssaults, 
                            Abandoned_Buildings,
                            Liquor_Retail,
                            Street_Lights_Out,
                            schoolsDistance,
                            loopDistance,
                            distclinic,
                            distpolice,
                            rodent,
                            distlandmarks,
                            spatialdis,
                            dist_Sanitation,
                            dist_potholes
              ))
summary(reg)
standardized <- as.data.frame(lm.beta(reg))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized$absCoef <- abs(standardized$std_coefficient)

ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) + geom_bar(stat="identity")

From the range result below,the highest risk cells have a rate of assault is about 100, which is much higer than the actual highest risk(88).

range(reg$fitted.values)
## [1]   0.0000000001899858 102.5873507562991733

3.2: Average predicted risk by fishnet

ggplot() +
  geom_sf(data=cbind(fishnet,reg$fitted.values), aes(fill=reg.fitted.values), colour=NA) +
  labs(title= "Predicted Burglary risk by fishnet") +
  scale_fill_viridis() +
  mapTheme()

This risk map shows the risk level of each fish net in Chicago City. It can be told that the high risk areas tend to cluster in space. The first luster appears in loop district, which is the center business district in Chicago. The secong cluster is in the center of the city.

3.3: Average predicted risk by police beat

final_net<-cbind(final_net,reg$fitted.values) %>%
  mutate(error = abs(countAssaults - reg.fitted.values))

newpoliceBeats <- st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON")

newpoliceBeats <- 
  newpoliceBeats %>%st_transform(st_crs(final_net)) %>% st_sf() %>% na.omit()

final_police <- 
  st_join(final_net, newpoliceBeats, join=st_intersects) %>%
  as.data.frame() %>%
  group_by(beat_num) %>%
  summarize(value=mean(reg.fitted.values)) %>%
  full_join(newpoliceBeats) %>%
  st_sf() %>%
  na.omit()

ggplot() +
  geom_sf(data=final_police, aes(fill=value), colour=NA) +
  labs(title= "Predicted Burglary risk by police beat") +
  scale_fill_viridis() +
  mapTheme()

This predicted risk value map aggergated by the police beat, which can be given to cops in each beat to work with.

4: Goodness of fit

4.1 Output MAE and MAPE

mean(final_net$error)
## [1] 4.160044

The MAE of this model is 4.16, which means the average error value of each fish net is about 4, which suugests that this model is not a good fit.

4.2: Neighborhoods extent predicted risk map

neighborhoods <- 
  neighborhoods %>%st_transform(st_crs(final_net)) %>% st_sf() %>% na.omit()

final_neighborhoods <- 
  st_join(final_net, neighborhoods, join=st_intersects) %>%
  as.data.frame() %>%
  group_by(name) %>%
  summarize(value=mean(reg.fitted.values)) %>%
  full_join(neighborhoods) %>%
  st_sf() %>%
  na.omit()

ggplot() +
  geom_sf(data=final_neighborhoods, aes(fill=value), colour=NA) +
  labs(title= "Predicted Burglary risk by neighborhoods") +
  scale_fill_viridis() +
  mapTheme()

Compared this map to the above two prediction maps before, this model is clearly bias in the extend of neighbourhoods. The risk values within the same neighbourhood can be quiet different. However, in this extend would, it might increase the affect coming from surronding areas and lead to a much higher risk value.

4.3:Kernel Density VS. Risk Prediction Plot

In order to access goodness of fit, we need to find out whether the model targets better than the Kernel Density approach in ArcGIS.

countComparisons <- read.csv("countComparisons.txt")

countComparisons <- 
  countComparisons %>%
  dplyr::select(gridcode, KernelCnt, fittedCnt)

countComparisons <- cbind(
  countComparisons,
  data.frame(Category = c("90% - 100%", "70% - 89%", "50% - 69%", "30% - 49%", "1% - 29%")))

countComparisons <- 
  countComparisons %>%
  dplyr::mutate(kernelPct = KernelCnt / sum(KernelCnt),
                fittedPct = fittedCnt / sum(fittedCnt))
countComparisons
##   gridcode KernelCnt fittedCnt   Category  kernelPct  fittedPct
## 1        1      5231       750 90% - 100% 0.27417580 0.03931024
## 2        2      4129      1943  70% - 89% 0.21641595 0.10183972
## 3        3      4851      3828  50% - 69% 0.25425861 0.20063945
## 4        4      3400      6584  30% - 49% 0.17820640 0.34509146
## 5        5      1468      5974   1% - 29% 0.07694324 0.31311914

Now we plot as a grouped bar plot.

countComparisonsLong <-
  countComparisons %>% 
  gather(Variable, Value, kernelPct:fittedPct)

ggplot(data=countComparisonsLong, aes(Category,Value)) +
  geom_bar(aes(fill = Variable), position = "dodge", stat="identity") +
  scale_fill_manual(values = c("red", "blue"),
                    labels=c("Risk Prediction", "Kernel Density")) +
  labs(x= "Predicted Risk Levels",
       y="Percent of correctly predicted cases",
       title= "Goodness of Fit: Risk Prediction vs. Kernel Density hotspot")

5: Interpretation

As can be seen from all the predicted risk map before, this model is clearly bias. When it comes to the neighbourhood extend ,it can sometimes affect the result. For instance, some lower risk bolocks, which are close to higher blocks, can be assumed higher than it actually is.

From the barplot above, we can tell that when the risk level is higher than 50%, the percentage of correctly predicted cases of risk prediction is higher than the Kernel Density in ArcGIS. On the other hands, when the risk level is relatively higher, the Kernel Density would be a better approach. THus, this poisson regression model is certainly not a good fit. For the further study, by adding more internal features like demographics can the model has a better goodness of fit.